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ABSTRACT 


In  this  paper  we  formulate  and  obtain  optimal  gambling  strategies  for 
certain  gambling  models.  We  do  this  by  setting  these  models  within 
the  framework  of  dynamic  programming  (also  referred  to  as  Markovian 
decision  processes)  and  then  utilize  results  in  this  field. 

In  Section  2  we  present  some  dynamic  programming  results.  In  partic¬ 
ular  we  review  and  expand  upon  two  of  the  main  results  in  dynamic 
programming.  Loosely  put  these  results  are: 

(i)  In  problems  in  which  one  is  interested  in  maximizing 
nonnegative  rewards,  a  policy  is  optimal  if  and  only  if 
its  expected  return  satisfies  the  optimality  equation,  and 

(ii)  in  problems  in  which  one  is  interested  in  minimizing 
nonnegative  costs,  the  policy  determined  by  the  optimality 
equation  is  optimal. 

In  Section  3  we  show  how  the  results  of  Section  2  may  be  applied  in 
some  simple  gambling  models.  In  particular  we  consider  the  situation 
where  an  individual  may  bet  any  integral  amount  not  greater  than  his 
fortune  and  he  will  win  this  amount  with  probability  p  or  lose  it 
with  probability  1  -  p  .  It  is  shown  that  if  p  >_  1/2  then  the 
timid  strategy  (always  bet  I  dollar)  both  maximizes  the  probability 
of  ever  reaching  any  preassigned  fortune,  and  also  stochastically 
maximizes  the  time  until  the  bettor  becomes  broke.  Also,  if  p  <  1/2 
then  the  timid  strategy  while  not  stochastically  maximizing  the 
playing  time  does  maximize  the  expected  playing  time. 

In  Section  4  we  consider  the  same  model  but  with  the  additional  struc¬ 
ture  that  the  bettor  need  not  gamble  but  may  instead  elect  to  work 
for  some  period  of  time.  His  goal  is  to  minimize  the  expected  time 
until  his  fortune  reaches  some  preassigned  goal.  We  show  that  if 
p  <  1/2  then  (i)  always  working  is  optimal,  and  (ii)  among  those 
strategies  that  only  allow  working  when  the  bettor  is  broke  it  is  the 
bold  strategy  that  is  optimal. 

In  Section  5  we  return  to  the  general  dynamic  programming  model  and 
consider  the  problem  of  determining  "good"  subclasses  of  policies. 

Two  counterexamples  are  presented. 
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DYNAMIC  PROGRAMMING  AND  GAMBLING  MODELS 


Sheldon  M.  Ross 


1.  INTRODUCTION  AND  SUMMARY 


In’ this  paper  we  formulate  and  obtain  optimal  gambling  strategies  for  certain 


gambling  models.  We  do  this  by  setting  these  models  within  the  framework  of 


dynamic  programming  'also  referred  to  as  Markovian  decision  processes)  and  then 


utilize  results  in  this  field. 


In  Section  2  we  present  some  dynamic  programming  results.  In  particular  we 


review  and  expand  upon  two  of  the  main  results  in  dynamic  programming.  Loosely 


put  these  results  are: 


(i)  In  problems  In  which  one  is  interested  in  maximizing  nonnegative 


rewards,  a  policy  is  optimal  if  and  only  if  its  expected  return  satisfies 


the  optimality  equation,  and 


(ii)  in  problems  in  which  one  is  interested  in  minimizing  nonnegative  costs. 


the  policy  determined  by  the  optimality  equation  is  optimal. 


In  Section  3  we  show  how  the  results  of  Section  2  may  be  applied  in'  some 


simple  gambling  models.  In  particular  we  consider  the  situation  where  an 


individual  may  bet  any  integral  amount  not  greater  than  his  fortune  and  he  will 


win  this  amount  with  probability  p  or  lose  it  with  probability  1  -  p  .  It  is 


shownthat  if  p  >_  1/2  then  the  timid  strategy  (always  bet  1  dollar)  both  maximizes 


the  probability  of  ever  reaching  any  preassigned  fortune,  and  also  stochastically 


maximizes  the  time  until  the  bettor  becomes  broke.  Also,  if  p  <  1/2  then  the 


timid  strategy  while  not  stochastically  maximizing  the  playing  time  does  maximize 


the  expected  playing  time. 


f- 
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In  Section  4  we  consider  the  same  model  but  with  the  additional  structure 
that  the  bettor  need  not  gamble  but  may  instead  elect  to  work  for  some  period  of 
time.  His  goal  is  to  minimize  the  expected  time  until  his  fortune  reaches  some 
preassigned  goal.  We  show  that  if  p  <  1/2  then  (i)  always  working  is  optimal, 
and  (ii)  among  those  strategies  that  only  allow  working  when  the  bettor  is  broke 
it  is  the  bold  strategy  that  is  optimal. 

In  Section  5  we  return  to  the  general  dynamic  programming  model  and  consider 
the  problem  of  determining  "good"  subclasses  of  policies.  Two  counterexamples 


are  presented. 


r»  m  vviwai  .♦  *n«*!i*> 


2.  SOME  DYNAMIC  PROGRAMMING  PRELIMINARIES 


Consider  a  process  that  is  observed  at  discrete  time  points  to  be  in  one 


of  the  possible  states  0,1,2,  ...  .  After  observing  the  state,  one  of  a  finite 


number  of  possible  actions  must  be  chosen.  If  the  process  is  in  state  i  and 


action  a  is  chosen  then  (i)  we  obtain  a  reward  R(i,a)  ;  and  (ii)  the  next  state 


of  the  process  is  chosen  according  to  the  Markov  transition  probabilities 


{Pjj  (a)  ,i,j  >.  0}  . 


If  R(i,a)  >  0  for  all  i  ,  a  ,  then  we  say  that  we  are  in  the  positive  case. 


A  policy  is  any  rule  for  choosing  actions,  and  for  each  policy  f  we  define 


V^(i)  to  be  the  total  expected  reward  earned  if  the  initial  state  is  i 


policy  f  is  employed.  Also,  let  V(i)  =  sup  V  (i)  .  The  following  equation, 

f  1 


known  as  the  optimality  equation,  is  easily  established  in  the  positive  case 


(see,  for  example,  page  121  of  [11]). 


V (i)  -  max  R(i,a)  +  J  P..(a)V(j) 

a  j  ' 3 


The  major  result  about  the  positive  case  that  we  shall  use  is  the  following. 


Proposition  1; 


Assume  R(i,a)  ^0  .  The  policy  f  is  optimal,  i.e.,  V^(i)  *  V(i)  V  i  , 

if  and  only  if  its  return  function  V^(i)  satisfies  the  optimality  equation. 


This  proposition  was  originally  proven  by  Blackwell  [1]  in  a  more  general 


setting  than  the  above.  A  simple  proof  is  as  follows. 


Proof  of  Proposition  1: 


If  f  satisfies  the  optimality  equation  then  it  follows  that  using  f  is 
better  (in  the  expected  reward  sense)  than  using  any  other  policy  for  exactly  one 


stage  and  then  switching  to  f  .  But  repeating  this  argument  after  the  first 
stage  shows  that  f  is  better  than  using  any  other  policy  for  exactly  2  stages 
and  then  switching  to  f  .  By  induction,  it  follows  that  f  is  better  than  using 
any  policy  for  n  stages  and  then  switching  to  f  .  But  as  R(i,a)  ^  0  it 
follows  that  the  expected  return  obtained  from  time  n  +  1  onward  is  nonnegative. 
Hence,  the  expected  return  from  f  is  greater  than  the  n-stage  return  from  any 
other  policy.  The  result  now  follows  by  letting  n  -*■  ®  . 

Q .  E .  D . 

Remark : 

The  above  proof  shows  that  it  is  not  necessary  to  require  that  R(i,a)  0  . 

A  weaker  sufficient  condition  would  be  that  V^(i)  >_  0  .  In  fact  an  even  weaker 
sufficient  condition  would  be  that 


lim  inf  E  [V, (X  )  |  X  -  i]  >  0  V  i  ,  V  g 
g  f  n  o  — 


when  Xn  is  the  state  of  the  process  at  time  n  . 


If  R(i,a)  <_  0  for  all  i  ,  a  ,  then  we  say  that  we  are  in  the  negative 
case.  In  this  case  it  is  more  natural  to  define  C(i,a)  *  -R(i,a)  ,  so  as  to 
minimize  nonnegative  costs  as  opposed  to  maximizing  nonpositive  rewards.  Letting 
denote  the  infimum  of  the  total  expected  cost  incurred  under  a  policy,  it  is 
again  easy  to  establish  the  optimality  equation  which  now  takes  the  following  form. 


(2) 


V*(i) 


min  C(i,a)  +  ]>  P±j  (a) V*  C j  ) ! 


The  following  proposition  was  originally  proven  by  Strauch  [12]. 


■  ■WJOirs^j'tf* 


Proposition  2: 


Assume  C(i,a)  ^  0  ,  and  let  f  be  a  policy  which,  when  the  process  is  in 
state  i  ,  selects  an  action  minimizing  the  right-hand  side  of  the  optimality 
equation  (2).  Then  f  is  optimal. 


Proof: 


Proposition  2  is  proven  by  noting  that  if  f  is  determined  by  the  optimality 
equation  (2)  then  we  can  get  within  e/2  of  if  we  use  f  for  exactly  one 

stage  and  then  switch  to  a  policy  within  e/2  of  the  optimal.  Repeating  this 


argument  n  times  shows  that  we  can  get  within  rr  +  — r  +...+—  of  V.  if  we 

2  22  2n 


use  f  for  n  stages  and  then  switch  to  a  policy  within  e/2  of  the  optimal. 
However,  as  costs  are  ncanegative  it  thus  follows  that  the  n-stage  cost  under  f 
is  smaller  than  e  and,  as  e  is  arbitrary,  the  result  follows  by  letting 
n  . 

Q » E .  D  • 

Remark: 


The  above  proof  shows  that  it  is  not  necessary  to  require  that  C(i,a)  ^  0  . 
It  is  sufficient  for  V^(i)  >_  0  ;  and  a  weaker  sufficient  condition  would  be  for 


lim  inf  E,[V.(X  )  I  X  -  i]  >  0  V  i  . 
f  *  n  1  o  — 


Unfortunately,  Proposition  1  is  not  true  in  the  negative  case;  nor  is 
Proposition  2  in  the  positive.  A  simple  counterexample  to  Proposition  2  in  the 
positive  case  which  is  due  to  Strauch  (12]  is  the  following:  The  states  are  given 
by  the  positive  integers,  and  when  in  state  i  we  have  the  choice  of  either 
accepting  a  terminal  reward  1  -  1/i  or  else  receiving  no  reward  and  going  to  state 
i  +  1  .  Clearly  an  optimal  policy  does  not  exist  and  hence  Proposition  2  could 
not  be  valid.  From  the  remark  following  the  proof  of  Proposition  2  it  does,  however. 


nrtriirtttianirr  i 


ittfittHiiaiiMi 


follow  in  the  positive  case  that  if  f  is 
then  f  is  optimal  if 


chosen  by  the  optimality  equation  (1) 


Ef[V(Xn)  |  Xq  =  i]  0  as  n  -*•  - 

The  following  counterexample  shows  that  Proposition  1  is  not  necessarily 
true  in  the  negative  case. 

Counterexample : 

There  are  two  states  and  two  actions. 

C(l,l)  =  0  C(1 , 2)  =  1  C(2,i)  -  0  i  "  1  .  2 

Pl,l(1)  "  1  Pl,2(2)  "  1  P2,2(J)  “  1  1  “  1  '  2 

Let  f  be  the  policy  that  always  chooses  action  2.  Then  Vf(l)  1  »  Vf(2) 
and 

Vf  (1)  <_  C(l,l)  +  Vf  (1) 

Vf (2)  <  C(2,l)  +  Vf (2)  . 

Hence  V  satisfies  the  optimality  equation  but  is  obviously  not  optimal. 

One  sufficient  condition  under  which  Proposition  1  will  be  valid  in  the 
negative  case  is  that  the  number  of  stages  in  our  problem  be  bounded.  That  is, 
suppose  that  there  exists  a  stopped  state  having  the  property  that  once  the 
process  enters  that  state  it  can  never  leave  it  and  all  costs  incurred  while  in 
that  state  are  0.  Then  it  follows  from  the  proof  of  Proposition  1  that  if  the 
time  until  the  process  first  enters  the  stopped  state  is,  with  probability  1, 
bounded,  for  each  initial  state  and  for  each  policy,  then  the  proposition  is  valid 
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A  second  sufficient  condition  requires  the  notion  of  a  stationary  policy. 

We  say  that  a  policy  is  stationary  if  the  action  it  chooses  at  any  time  is  a 
deterministic  function  of  the  state  of  the  process  at  that  time.  If  f  is  a 
stationary  policy,  we  define  f(i)  to  be  the  action  f  chooses  when  the  process 
is  in  state  i  .  We  are  now  ready  for 


Proposition  3: 

Assume  that  C(i,a)  0  .  If  the  state  and  action  spaces  are  both  finite, 
and  if  f  is  a  stationary  policy  such  that 


V..(i)  <  C(i,a)  +  l  P±J  (a)Vf  (j)  V  a  i  f(i)  ,  V  i 


where  V^(i)  is  the  total  expected  cost  incurred  under  f  ,  then  f  is  optimal. 


Proof: 

We  introduce  a  discount  factor  a  ,  0  <  a  <  1  ,  and  define  V°(i)  to  be  the 

total  expected  discounted  cost  incurred  under  policy  f  .  Since  C(i,a)  _>  0  , 

it  follows  from  Lebesque's  monotone  convergence  theorem  that  lim  V^(i)  =  Vf(i)  . 

a+1 

Hence,  since  the  state  and  action  spaces  are  both  finite  it  follows  from  ^.3)  that 


v“(i)  <  C(i,a)  +  a  l  P..(a)v“(j)  V  a  *  f(i)  ,  V  i 


for  all  a  sufficiently  near  1.  But,  as  is  well  known,  if  the  expected  discounted 
cost  from  a  policy  satisfies  the  discounted  optimality  equation  then  that  policy 
is  discount  optimal.  Hence  f  is  a-discount  optimal  for  all  a  near  1.  Therefore, 
for  any  policy  g  ,  V^(i)  <_  Va(i)  for  a  near  1  and  the  result  follows  by 
letting  a  -*•  1  . 


Q.E.D. 


I W  PtUlH-J.IIH  1 1.  Ill  Jilllim  MJ]  Wl  1  ■ .  1 1  1'  .  Ill  II  M . ffPWjWUMiM  P» 


f  -  rc» *  .«MWW1  <*»V»T»' .  ,t' 


A  Technical  Remark 

In  the  above  formulation  we  have  assumed  a  countable  state  space  and  a 
finite  action  space,  in  the  general  setting  of  arbitrary  state  and  action  spaces 
these  results  are  more  difficult  to  prove.  The  main  difficulty  lies  in 
establishing  the  optimality  equation 


V(x)  -  sup  |R(x,a)  +  J  V(y)dP(y  |  x,a)j 


I 


The  reason  for  this  difficulty  is  that  it  is  not  easy,  in  general,  to  prove  that 
V(x)  is  a  measurable  function.  However,  assuming  that  R(x,a)  is  a  measurable 
function  and  that  the  probability  transition  density  P(*  (  x,a)  is  a  regular 
conditional  probability  measure  (this  implies  that  g(x)  =  J  h(y)dP(y  |  x,a)  is 
a  measurable  function  whenever  h(x)  is),  then  we  can  easily  establish  the 
measurability  of  V(x)  in  the  positive  case  as  long  as  the  action  space  is 
countable.  This  is  shown  by  defining 


V^Cx)  -  sup  R(x,a) 
a 


VnflOO  ■  sup  |R(x,a)  +  J  Vn(y)dP(y  |  x,a)|  . 


As  the  supremum  of  a  countable  number  of  measurable  functions  is  itself  measurable, 

it  follows  by  induction  that  V  (x)  is  a  measurable  function.  Also,  as  V  (x) 

n  n 

is  the  optimal  expected  return  function  for  an  n-stage  problem  it  follows,  in 

the  positive  case,  that  lim  V  (x)  *  V(x)  .  For  any  policy  f  ,  since  the  n-stage 

n  L 

return  under  f  is  less  than  V  ,  we  have  that  V£(x)  <  lim  V  (x)  .  On  the  other 

n  t  n 

n 

hand,  since  rewards  are  nonnegative  we  may,  for  any  e  ,  define  a  policy  f 

such  that  (x)  V^(x)  -  e  .  Hence,  V(x)  <_  lim  V  (x)  <_  lim  (x)  +  e  <_  V(x)  + 

n  n  n 

The  measurability  of  V  now  follows  from  the  fact  that  the  limit  of  a  countable 


9 

number  of  measurable  functions  Is  Itself  measurable. 

In  the  negative  case  It  Is  not  necessarily  true  that  lim  V  (x)  Is  equal  to 

n 

V(x)  .  However,  If  In  addition  to  assuming  thnt  the  action  space  Is  countable 
we  also  assume  that  the  one-stage  costs  are  bounded  then  we  can  again  easily 
establish  the  measurability  of  V(x)  .  This  is  done  by  introducing  a  discount 
factor  a  ,  0  <  a  <  1  .  Define  Vfl(x)  to  be  the  optimal  discounted  cost  function. 
By  the  same  argument  as  given  in  the  positive  case  it  follows  that  Vq(x)  is 
measurable.  (Since  costs  are  bounded  the  discount  factor  assures  us  that  the 
optimal  n-stage  discounted  cost  function  will  converge  to  the  optimal  infinite 
stage  discounted  cost  function.)  The  measurability  of  V(x)  now  follows  since, 

by  Lebesque's  monotone  convergence  theorem,  V(x)  -  lim  V  ^(x)  . 

n-*»  - 

n 

As  long  as  the  action  space  is  countable,  and  V(x)  is  measurable  then 
Propositions  1  and  2  go  through  exactly  as  before.  For  the  analysis  in  the  most 
general  cases  the  interested  reader  should  consult  Blackwell  [1],  [2],  and  Strauch 
[12]. 
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3.  THE  RED-BLACK  GAMBLING  MODEL 


The  red-black  gambling  model  is  concerned  with  the  following  situation.  An 
individual  enters  a  gambling  casino  that  allows  any  bet  of  the  following  form: 

I 

If  you  have  a  fortune  of  i  units  then  you  are  allowed  to  bet  any  positive 

» 

integral  amount  less  than  or  equal  to  i  .  Furthermore  if  you  bet  j  then 

*  I 

you  either  win  j  with  probability  p  or  lose  j  with  probability  q  =  1  -  p  . 

The  first  problem  we  shall  consider  is  that  of  maximizing  the  probability 
that  an  individual  will  attain  a  fortune  of  N  before  going  broke.  This  problem 

i 

fits  the  framework  of  positive  dynamic  programming  since  if  we  suppose  that  a 
terminal  reward  of  1  is  earned  if  we  ever  reatch  state  N  and  all  other  rewards 
are  zero,  then  the  expected  total  reward  equals  the  probability  of  ever  reaching 
state  N  .  In  order  to  determine  an  optimal  policy  we  first  note  that  if  our 
present  fortune  is  i  then  it  would  never  pay  to  bet  more  th^n  N  -  i  .  Hence,  ! 
from  proposition  1  it  follows  that  a  policy  f  will  be  optimal  if  and  only  if 
its  return  satisfies 

i  i ' 


(4)  V  (i)  >  PV  (i  +  k)  +  qV  (i  -  k)  ,  0  <  i  <  N  ,  ' 

k  ^  min(i ,  N  -  i) 


Define  the  timid  strategy  to  be  that  strategy  which  always  bets  1  .  Under  . 

:  *  • 

this  strategy  the  game  becomes  the  classic  gamblers'  ruin  model  and  U(i)  ,  the 

probability  of  reaching  N  before  going  brdke  when  you  start  with  i  ,  0  <  i  <  N 

! 

is  given  by 


(5) 


U(i) 


— p  * 1/2 

1  -  (q/p) 

i/N  p  -  1/2 


Theorem  1 : 


If  p  1/2  the  timid  strategy  maximizes  the  probability  of  ever  attaining 


a  fortune  of  N  . 


Proof: 


If  p  -  1/2  then  U(i)  -  1/N  trivially  satisfies  (4).  When  p  >  1/2 


we  must  show  that 


- - >  P  -  T 

i  -  (q/p)N  I  i  -  (q/p) 


1  -  (q/p) 
l  -  (q/p)1 


or  equivalently  that 


(q/p)1  1  p(q/p)i+k  +  q  (q/p) 


1 1  pI  (q/p)k  +  (p/q)k_1] 


Note  that  the  above  holds  for  k  ■  1  and  the  result  will  be  proven  if  we  can 

(.x  i  vX-1 

^|  +  is  an  increasing  function  of  x  for  x  >,  1  when 


p  >  1/2  .  This  however  follows  immediately  upon  differentiation. 


Q  •  E .  D  • 


Theorem  1  seems  to  be  one  of  those  results  that  are  well-known  but  never 
seem  to  have  been  specifically  proven  in  the  literature.  Of  course,  timid  play 
was  known  to  be  optimal  among  the  class  of  strategies  that  always  bet  a  fixed 
amount  at  each  stage. 


Define  the  bold  strategy  to  be  the  strategy  which,  if  our  present  fortune 


is  i 


bets  i  if  i  £  N/2 
bets  N  -  i  if  i  >  N/2 


In  [6]  Dubins  and  Savage  have  shown  that  the  bold  strategy  maximizes  the 
probability  of  ever  attaining  a  fortune  of  N  when  p  <_  1/2  .  Their  approach 
was  similar  in  that  they  proved  this  result  by  showing  that  the  return  from  the 
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bold  strategy  satisfies  the  optimality  equation  (4)  whenever  p  _5_  1/2  . 

However,  as  opposed  to  the  timid  strategy  case  it  is  not  possible  to  easily 
obtain  an  exact  expression  for  the  return  from  the  bold  strategy  and  Dubins 
and  Savage  had  to  resort  to  a  quite  ingenious  proof  to  establish  (4). 

Thus  when  p  _<  1/2  bold  play  is  optimal  while  if  p  >_  1/2  then  it  is 
timid  play  that  is  optimal.  (When  p  ■  1/2  it  follows  from  Martingale  Theory 
that  any  strategy  that  never  bets  to  strictly  exceed  N  is  optimal.)  AI30 
by  regarding  your  losses  as  the  winnings  of  your  opponent  it  follows  that  if 
p  <_  1/2  then  your  worst  possible  strategy  is  the  timid  strategy  and  if  p  1/2 
then  your  worst  possible  strategy  is  the  bold  one  (assuming,  of  course,  that 
you  would  never  consider  betting  more  than  N  -  i  when  your  present  fortune 
is  i). 

Suppose  now  that  our  objective  is  not  to  reach  some  preassigned  goal  but 
rather  is  to  maximize  our  playing  time.  We  now  show  that  if  p  >_  1/2  then 
timid  play  stochastically  maximizes  our  playing  time.  That  is,  for  each  n  , 
the  probability  that  we  will  be  able  to  play  n  or  more  times  before  going 
broke  is  maximized  by  the  timid  strategy. 

Theorem  2; 

If  p  1/2  then  timid  play  stochastically  maximizes  our  playing  time. 

Proof ; 

By  assuming  that  a  reward  of  1  is  attained  if  we  are  able  to  play  at  least 
n  times,  we  see  that  this  problem  also  fits  the  framework  of  the  positive  case. 
Hence  we  must  show  that  starting  with  i  ,  it  is  better  to  play  timidly  than  it 
is  to  make  an  initial  bet  of  k  ,  k  <_  i  ,  and  then  play  timidly.  However  this 
follows  since,  by  Theorem  1,  the  timid  strategy  maximizes  the  probability  that 
we  will  get  to  i  +  k  before  i  -  k  and  it  takes  at  least  one  unit  of  time. 

More  formally,  letting  Un(i)  denote  the  probability  that  we  will 
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be  able  to  play  at  least  n  times  given  that  our  Initial  fortune  is  1  and 
we  play  timidly,  we  obtain  by  conditioning  on  the  time  T  that  our  fortune 
reaches  either  1  -  k  or  i  +  k  and  the  value  X  that  is  reached  we  obtain 

U  (i)  -  E[U  _(X)] 
n  n-i 

>  F.[U  .  (X)  ] 

—  n-i 

-  U  , (i  +  k)P{X  -  i  +  k}  +  U  . (i  -  k)P{X  -  i  -  k} 
n-i  n-i 

>  pU  (i  +  k)  +  qU  (i  -  k) 

—  n-i  n-i 

The  first  inequality  follows  from  the  fact  that  unU)  is  a  decreasing  function 
of  n  and  T  ^  1  ,  while  the  second  inequality  follows  since  P{X  ■  i  +  k}  _>  p 
by  Theorem  1  and  U  ,  (i  +  k)  >  U  .  (i  -  k)  .  Q.E.D. 

In  [9]  Molenaar  and  Van  Der  Velde  considered  a  gambling  casino  that 

accepted  any  (k,  c)  bet  when  k  and  c  are  integers.  A  (k,  c)  bet  would 

c  k 

win  k  with  probability  ^  and  would  lose  c  with  probability  —  . 

Note  that  these  are  all  fair  bets  in  the  sense  that  the  expected  gain  is  zero. 

They  proved,  by  a  concavity  argument,  that  the  timid  strategy  (always  bet  (1,1)) 

stochastically  maximizes  the  bettors  playing  time  before  going  broke.  This  result, 

however,  also  easily  follows  by  our  approach  since  playing  timidly  is  better  than 

making  any  initial  (k,  c)  bet  and  then  following  this  initial  bet  with  timid 

play.  This  is  true  since  under  timid  play  we  would  also  reach  i  +  k  before 

Q 

i  -  c  with  probability  +-  -  and  the  amount  of  time  until  reaching  either 
value  is  at  least  one.  (Here,  of  course,  i  is  the  bettors  initial  fortune.) 

It  turns  out  that  if  p  <  1/2  then  the  timid  strategy  does  not  stochastically 
maximize  our  playing  time.  For  suppose  p  ■  .1  and  we  start  with  an  initial 
fortune  of  2  .  The  probability  that  we  will  be  able  to  play  at  least  5  games 
if  we  play  timidly  is  1  -  (.9)^  -  2(.9)\.l)  ■  .0442  .  On  the  other  hand  if 
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we  bet  2  initially  and  then  play  timidly  then  the  probability  of  playing  at 
least  5  games  is  .1  .  It  is  however  true  that  timid  play  maximizes  our 
expected  playing  time. 

Theorem  3: 

If  p  <  1/2  then  timid  play  maximizes  our  expected  playing  time. 


Proof : 


Let  U(i)  denote  the  expected  number  of  bets  made  before  we  go  broke 

given  that  we  start  with  i  and  always  bet  1  .  To  calculate  U(i)  ,  let 

X,  denote  your  winnings  on  the  j**1  bet  and  let  T  denote  the  number  of 

J  x 

bets  you  make  before  going  broke.  Then,  since  £  X.  =  -  i  we  have  by  Wald's 

’“1  J 


equation  that 


-  i  -  EX  ET 
or 

0(1)  -  ET  - 

Since  maximizing  our  expected  playing  time  falls  under  the  positive  case  (we 
receive  a  reward  of  1  each  time  that  we  are  able  to  continue  playing)  the 
result  follows  since 

U(i)  _>  1  +  pU(i  +  k)  +  qU(i  -  k)  ,1  <k<  i 

follows  by  direct  verification.  Q.E.D. 

Theorem  3  remains  true  in  more  general  gambling  models.  Consider  a  gambling 
casino  that  allows  you  to  make  any  bet  such  that ,  when  your  present  fortune  is 
i  ,  the  outcome  of  the  bet  is  an  integer  valued  random  variable  X  satisfying 

(i)  X  >_  -  i  with  probability  1 

(ii)  |x|  1  with  probability  1 

(iii)  EX  <  o  -  1 


I  §■■• 

I 


;  '  - ■■l-r"W'  '|T  '  ' 


>•»,« ' T,f*^  1*1 


where  a  is  some  fixed  positive  number  less  than  1  .  It  follows  now  in  the 
same  manner  as  in  Theorem  3  that  the  timid  strategy  which  always  bets  .  to 
either  win  or  lose  1  with  respective  probabilities  a/2  and  1  -  a/2  maxi¬ 
mizes  your  expected  playing  time.  This  is  true  since  U(i)  ■  ^  *  —  is  easily 


shown  to  satisfy 


U(i)  ^1+  EU (i  +  X) 


whenever  X  satisfies  (i) ,  (ii),  and  (iii). 

The  above  also  shows  that  playing  in  an  unfair  game  with  a  minimum  bet 
will  eventually  break  you  and  in  a  finite  expected  time  (compare  with  Breiman 

[4]  p.  101). 


4.  A  GAMBLING-WORK  MODEL 

In  this  section  we  consider  the  following  variation  of  the  red-black 
gambling  model.  We  again  suppose  that  a  bettor  whose  fortune  is  i  may  bet 

any  amount  j  ,  j  i  and  win  or  lose  j  with  respective  probabilities  p  and 

q  =  1  -  p  .  However,  we  now  suppose  that  the  bettor  need  not  place  any  bet  at  all 
but  rather  may  elect  to  work.  If  he  decides  to  work  then  he  works  for  c  units 
of  time  and  earns  1  dollar.  Assuming  that  each  gamble  takes  1  unit  of  time  the 
problem  is  to  determine  a  strategy  that  minimizes  the  expected  time  until  our 
worker-gambler  attains  a  fortune  of  N  dollars.  (The  worker-gambler  must  work 
when  he  is  broke.) 

Theorem  4: 

If  p  1/2  then  the  strategy  of  always  working  minimizes  the  expected  time 
until  a  fortune  of  N  is  attained. 

Proof; 

The  expected  time  to  reach  N  if  we  start  from  i  and  always  work  is 

given  by  U(i)  ■  (N  -  i)c  .  Since  the  above  is  clearly  a  problem  of  minimizing 

nonnegative  costs  we  shall  apply  Proposition  3.  Hence,  we  need  show  that 

U(i)  <  1  +  EU(i  +  X) 

or  equivalently 

(N  -  i)c  <  1  +  cE(N  -  i  -  X) 

or 

0  <  1  -  cEX 

which  follows  since  EX  ,  the  expected  gain  of  a  bet,  is  negative.  In  fact,  since 


Ill',  ' will  t 


Kimi 


&  ff ■?  ^IT  lie; 
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EX  ^  2p  -  1  ,  the  above  shows  that  always  working  Is  optimal  for  all  values  of 


p  such  that  p  < 


c  +  1 


2c  * 


Q.E.D. 


Thus  for  p  _<  1/2  the  optimal  strategy  Is  always  to  work.  However,  let  us 


now  consider  this  same  problem  but  under  the  assumption  that  the  gambler  will  only 


work  when  broke  and  thus  will  gamble  at  all  other  times.  Under  this  condition  we 


shall  show  that  It  Is  the  bold  strategy  that  Is  optimal. 


Let  uo  suppose  that  each  time  the  gambler  reaches  a  fortune  of  N  he  gives 


that  amount  away  and  then  starts  to  play  again.  That  is  after  reaching  N  he 


then  works  for  c  units  of  time  and  then  starts  gambling  again  with  a  fortune  of 


1  dollar.  Let  us  say  that  a  cycle  Is  completed  each  time  the  gambler's  fortune 


reaches  0  or  N  .  Letting  denote  the  time  of  the  1th  cycle  and  letting  T 


denote  the  number  of  cycles  that  it  takes  our  gambler  to  reach  a  fortune  of  N 


we  have  that  the  expected  time  until  the  gambler  reaches  N  is  given  by 


El  £  X  I  ,  Now  if  the  gambler  is  initially  broke  and  he  employs  a  stationary 

Li-1  J 


strategy  then  the  random  variables  X^.Xj,  ...  are  Independent  and  identically 
distributed.  Hence,  by  Wald's  equation  the  expected  time  until  the  gambler  first 


reaches  N  is  given  by  EXET  .  However,  X  is  a  geometric  random  variable  with 


mean  1/a  where  a  is  the  probability  that  starting  with  1  the  gambler  will 


reach  N  before  0.  Hence,  by  the  Dubins-Savage  result  it  follows,  since  p  <_  1/2, 


that  ET  is  minimized  by  the  bold  strategy.  Therefore,  as  we  know  by  Proposition  2 


that  an  optimal  stationary  strategy  exists,  we  can  show  that  the  bold  strategy  is 


optimal  if  we  can  show  that  it  minimizes  EX  .  That  is  we  need  to  show  in  the 


original  red-black  model  that  the  bold  strategy  minimizes  the  expected  time  until 


the  gambler  reaches  a  fortune  of  either  0  or  N  .  In  fact,  we  shall  establish 


this  by  proving  the  stronger  result  that  the  bold  strategy  minimizes  E[min  (X,n)] 


for  all  n  .  That  is,  if  the  bettor  is  allowed  to  play  at  most  n  stages  and  if 


1 
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he  stops  before  this  if  he  ever  reaches  0  or  N  then  the  strategy  minimizing 
his  playing  time  is  the  bold  one.  The  reason  for  considering  this  modified 
problem  is  that  it  is  a  problem  with  a  bounded  number  of  stages  and  thus  we  would 
only  need  show  that  the  expected  playing  time  under  the  bold  strategy  satisfies 
the  optimality  equation. 

Following  Dubins  and  Savage  (Chapter  5  cf  [6])  we  first  generalize  the  model 
as  follows:  We  suppose  that  the  initial  fortune  may  be  any  number  in  (0,1) 
and  that  the  bettor  stops  playing  either  when  his  fortune  reaches  0  or  1  or  when 
he  has  already  played  n  times.  We  shall  refer  to  this  as  the  n-stage  red-black 
model. 

Define  the  bold  strategy  to  be  the  strategy  which,  if  the  bettors  present 
fortune  is  r  and  he  is  allowed  to  bet 

bets  r  if  r  <_  1/2 

bets  1  -  r  if  r  >  1/2  . 

Let  U  (r)  denote  the  bettors  expected  playing  time  in  the  n-stage  red-black 
n 

model  if  the  initial  fortune  is  r  and  the  bold  strategy  is  employed.  By 
conditioning  upon  the  outcome  of  the  first  play  we  obtain 

1  +  pU  _x(2r)  0  <  r  <_  1/2 

Un(r)  - 

(7)  U  +  qU  x(2r  -  1)  1/2  <  r  <  1 

U  (1)  -  U  (0)  *  0  ,  U  (r)  -  0  ,  0  <  r  <  1  . 
n  n  o 

Theorem  5: 

In  the  n-stage  red-black  model,  among  those  strategies  which,  when  the  bettor's 
fortune  is  r  ,  never  bet  an  amount  greater  than  1  -  r  ,  the  bold  strategy  minimizes 
the  bettor's  expected  playing  time. 


•»f«n \**r**’n  rW*? 


Proof: 

Assume  first  that  p  _>  1/2  .  It  suffices  to  prove  that 
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(8)  U  (r)  <  1  +  pU  .(r  +  a)  +  qUn  .  (r  -  s) 

n  —  n-i  n-i 

for  all^  0<^s<^r,8£l-r.  A  number  of  the  form  wherp  i  and  k 

are  nonnegative  integers  such  that  i  <  2  will  be  said  to  be  of  order  at  most  k  . 

For  example  0  and  1  are  of  order  at  most  0;  0,  1/2,  1  are  of  order  at  most  1,  etc. 

We  first  show,  by  induction,  that  (8)  holds  for  all  n  whenever  r  and  s  are 
of  order  at  most  k  . 

Since  U  (0)  -  U  (1)  -  0  it  follows  that  (8)  holds  for  all  n  whenever 
n  n 

r  and  s  are  both  of  order  0  .  So  assume  that  (8)  holds  for  all  r  and  s  of 
order  at  most  k  and  suppose  that  r  and  s  are  of  order  at  most  k  +  1  .  We 
first  note  that  if  x  is  of  order  at  most  k  +  1  then 

2x  is  of  order  at  most  k  when  x  ^  1/2 
2x  -  1  is  of  order  at  most  k  when  x  >  1/2  . 

There  are  four  cases  we  need  consider. 

Ca&e  1:  r  +  s  1/2  ,  s  <_  r 

In  this  case  we  have  by  (7)  that 

Un(r)  -  pUn_1(r  +  s)  -  qUQ_1(r  -  s)  -  1 

-  1  +  pUn_1(2r)  -  p  -  p2Un_2(2r  +  2s)  -  q  -  qpUn_2(2r  -  2s)  -  1 
-  p[Un_x(2r)  -  pUn_2(2r  +  2s)  -  qUn_2(2r  -  2s)]  -  1  . 

But  2r  and  2s  are  both  of  order  at  most  k  and  so  the  above  is  nonpositive 
by  the  induction  hypothesis. 


Case  2:  r  -  s  >  1/2 


The  proof  Is  just  as  In  Case  1,  except  that  the  second  functional  equation 
of  (7)  la  used  Instead  of  the  first. 


Case  3:  r  <_  1/2  <_  r  +  s  ,  s  <_  r 


From  (7)  we  have  that 


U  <r)  -  PU  <r  +  s)  -  qU  . (r  -  s)  -  1 
n  n-i  n-i 

1  +  PVl(2r)  "  p  ‘  P<lUn_2(2r  +  2s  -  i)  -  q  -  qpUn_2(2r  -  2s)  -  1 


Now,  since  r  >  s  ,  it  follows  that  2r  >  r  +  s  >  1/2  and  thus 


un_l(2r)  “  1  +  qUn-2(4r  ‘  X) 


Also,  since  2r  -  1/2  <  1/2  we  also  have  that 


Un_x(2r  -  1/2)  -  1  +  pUn_2(4r  -  1) 


Thus  from  (10)  and  (11)  we  obtain  that 


pUn_x(2r)  -  p  +  q[Un_1(2r  -  1/2)  -  1] 


Inserting  this  into  (9)  yields  that  (9)  is  equal  to 


P  +  qU  . (2r  -  1/2)  -  q  -  pqU  ,(2r  +  2s  -  1)  -  qpU  «(2r  -  2s)  -  1 
n-  j.  n- 1.  n-  l 

-  q[Un_1(2r  -  1/2)  -  pUn_2(2r  +  2s  -  1)  -  pUn_2(2r  -  2s)  -  1]  +  p  -  1 


Now,  if  s  >_Hk  then,  since  p  >_  q  ,  we  have  that  (12)  is  less  than  or  equal  to 


q[U  (2r  -  1/2)  -  pU  ,(2r  +  2s  -  1)  -  qU  9(2r  -  2s)  -  1)  +  p  -  1 

n-i.  n—  *.  n—  L 


which  is  nonpositive  by  the  induction  hypothesis  since  both  2r  -  1/2  and 

2s  -  1/2  are  both  of  order  at  most  k  .  On  the  other  hand,  if  s  <  1/4  then  since 


MWM»*IWf'll«5<W<TW 
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p  >  q  we  have  that  (12)  Is  less  than  or  equal  to 


q[Un_1(2r  -  1/2)  -  pUR_2(2r  -  2s)  -  qUn_2(2r  +  2s  -  1)  -  1]  +  p  -  1 


which  is  nonpositive  by  the  induction  hypothesis  since  2r  -  1/2  and  1/2  -  2s  are 
both  of  order  at  most  k  . 

Case  A:  r  -  s  <_  1/2  <_  r 

The  proof  for  this  case  is  similar  to  that  of  the  preceding  ones  and  the 
induction  is  completed. 

Thus,  when  p  ^  1/2  ,  we  have  proven  (8)  whenever  r  and  s  are  binary 

k 

rationals.  (That  is,  numbers  of  the  form  i/2  .)  By  considering  a  second  player 

whose  initial  fortune  is  1  -  r  ,  and  by  viewing  the  winnings  of  the  first 
player  as  the  losses  of  the  second  one  and  vice-versa,  it  follows  since  both 
players  play  the  same  amount  of  time  that  this  result  is  also  true  when  p  <_  1/2  . 

It  thus  remains  to  establish  (8)  when  r  and  s  are  not  binaries.  We 
will  do  this  by  showing  that  U^r)  is  continuous  at  r  whenever  r  is  not  a 
binary  rational.  Then  by  letting  {r^  ,j  >_  1}  and  (s^ ,  j  1}  be  sequences  of 
binaries  such  that  r^  -*■  r  and  s^  -*■  s  it  follows  by  continuity  that  (8)  holds 
for  all  0<^s£r,s<^l-r.  The  following  lemma  thus  completes  the  proof. 


Lemma  6: 


Proof : 


U^(r)  is  continuous  at  r  whenever  r  is  not  a  binary  rational. 


Let  B  denote  the  set  of  binary  rationals,  and  define 


s  *  supremum  lim  sup  |  ( r ^ )  -  Un(r) 


(rrr},n  j 


s+  ■  supremum  lim  sup  |U  (r.)  -  U  (r)  |  . 
(tj  ,r  },n  j  n  J 
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That  is,  s~  and  s+  measure  the  worst  possible  discontinuity  of  any  of  the 

functions  Un(r)  when  r  is  not  binary.  Note  that  since  there  is  a  positive 

probability  of  at  least  min  (p,l  -  p)  that  play  will  end  at  each  stage  when  the 

bold  strategy  is  employed,  it  follows  that  Up(r)  <.  ^  .  Hence  s  and 

s+  are  both  finite.  Note  also  that  U  (r)  is  continuous  at  all  r  i  B  if 

n 

and  only  if  s”  ■  s+  ■  0  .  We  show  this  by  contradiction.  Assume,  for  instance, 
that  s"  >  0  .  Let  n  ,  {r^ }  ,  r  ,  be  such  that 


r.  -*•  r~  ,  r  i  B  ,  lim  sup  |u  (r  )  -  U  (r)  |  >  s  max  (p,q) 
j  j  n  j  n 


There  are  two  cases. 


Case  1:  r  <  1/2 


In  this  case 


W  • 1  + 

UE(r)  -  1  +  pUn.1(2r) 


Therefore, 


n  sup  |u  _1(2r  )  -  U ,  (2r)|  -  lim  sup  |u  (r  )  -  U  (r)|/p 
j  n™  x  j  n_  x  j  n  j  n 


>  8  p/p 


which  is  a  contradiction  since  2r^  -*■  2r  and  2r  t  B  . 


Case  2:  r  >  1/2 


In  this  case  for  all  j  sufficiently  large  (so  that  r^  >  1/2) 


W  ’ 1  +  ’Vi(2rj  -  » 

U  (r)  -  1  +  qU  .  (2r  -  1) 
n  n-i 


Implying  Chat 


lim  sup  111^(21:^  -  1)  -  UQ_1(2r  -  1)  |  -  11m  sup  [U^Cr ^ )  -  Un(r)|/q 

>  s  ^ 

q 

which  is  a  contradiction  since  2r^  -  1  -*•  (2r  -  1)  and  2r  -  1  t  B  . 

Hence  s  must  be  0.  A  similar  argument  holds  for  s+  and  hence  the 
Lemma  1  and  thus  Theorem  5,  are  proven. 

Q.E.D. 

Applying  Theorem  5  to  the  red-black  gambling-work  model  yields  the 
following. 

Theorem  6: 

In  the  Red-Black  Gambling-Work  model  if  the  gambler  will  only  work  when 
his  fortune  is  zero  then  the  bold  strategy  minimizes  the  expected  time  until  he 
reaches  his  goal. 

Proof: 

This  theorem  has  already  been  proven  (see  the  remarks  following  Theorem  4) 

when  the  gambler’s  initial  fortune  is  zero.  For  an  arbitrary  initial  fortune 

the  proof  is  exactly  as  before;  we  let  X^  denote  the  length  of  the  ith  cycle 

and  T  the  number  of  required  cycles.  Then  ...  are  independent 

(though  X^  has  a  different  distribution  from  the  others).  From  Wald's  equation 

T  T 

the  expected  time  until  the  gambler  reaches  N  is  E  \  X.  -  E  \  EX.  .  By 

i-1  1  i-1  1 

the  Dubins-Savage  result,  ET  is  minimized  by  the  bold  strategy  and  by  Theorem  5 
EX^  is  minimized,  for  all  i  ,  also  by  the  bold  strategy. 
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We  have  not  obtained  any  general  results  for  the  work-gambling  model  when 
p  >  1/2  .  There  are,  however,  reasonable  conjectures  thht  may  or  may  not  be 
true.  For  instance  if  we  are  always  free  to  either  work  or  gamble  then  it  seems 
reasonable  that  the  optimal  strategy  could  be  chosen  so  as  to  have  the  following 

i  . 

3-region  structure 

I  < 

Optimal  Strategy 


Work 

Gamble 

Work 

0 

Fortune 

N 

Th  is  we  should  work  when  our  fortune  is  either  near  0  or  near  N  and  gamble 
otherwise  (of  course,  any  of  these  regions  may  be  vacuous).  For  example,  suppose 
that  c  ,  the  unit  of  work  time,  is  1.  Then  it  is  obvious  that  it  is  optimal  to 

work  when  our  fortune  is  quite  small  (for  instance  1)  or  quite  large  (for  instance 

! 

N  -  1) ,  and  the  conjecture  is  that  you  should  gamble  with  an  in-between  fortune. 

*  i 

Of  course,  the  amount  gambled  remains  to  be  determined. 

When  the  gambler  is  only  permitted  to  work  when  broke,  the  problem  is  to 
determine  how  much  he  is  to  gamble  at  each  fortune.  In  [3]  Brel  man  has  showh 
that  in  the  pure  gambling  red-black  model  if  one  is  allowed  to  bet  any  fraction 
of  his  fortune  (and  not  just  integral  amounts)  then  the  strategy  that  asymptotically 
minimizes  his  expected  time  to  reach  some  preassigned  goal  is  the  Kelly  strategy 

i 

which  always  bets  the  fixed  fraction  p  -  q  of  your  fortune.  Of  course  Breiman's 
model  does  not  allow  you  to  work  when  broke  and  thus  rules  out  any  such  strategy 
as  the  bold  one  (which  would  have  an  infinite  expected  time).  Nevertheless, 
assuming  that  we  change  the  model  so  as  to  allow  us  to  bet ;  any  fraction  of  our 
fortune  then  it  may  turn  out  that  the  Kelly  strategy  would  remain,  in  some  sense, 

I 

asymptotically  optimal. 
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5.  Some  Counterexamples  In  Dynamic  Programming 

In  this  section,  we  return  to  the  general  dynamic  programming  model  consisting 
of i a  countable  state  space  and  a  finite  action  space,  and  in  which  the  objective 
is  to  maximize  nonnegative  rewards  (i.e.,  the  positive  case).  In  most  instances 
when  a  specific  model  (such  as  the  gambling  models  of  Sections  3  and  4)  is 
analyzed  within  this  framework,  it  turns  out  that  we  need  only  consider  stationary 
policies.  However,  this  is  not  always  the  case  (though  Blackwell  [2]  did  show  that 
if  an  optimal  policy  exists  then  there  is  a  stationary  optimal  policy)  and  it  is 
Worthwhile  to  try  to  determine  some  subclass  of  policies  such  that  for  any  arbitrary 
policy  there  necessarily  exists  a  policy  within  this  subclass  which  performs  at 
least  as  well.  We  now  define  certain  subclasses  of  policies. 

A  policy  is  said  to  be 

(1)  stationary ,  if  the  action  it  chooses  at  any  time  is  a  deterministic 
function  of  the  state  at  that  time. 

(2)  randomized  stationary ,  if  its  action  at  any  time  is  a  randomized 
function  of  the  state  at  that  time. 

(3)  Markov  or  memoryless ,  if  its  action  at  time  t  is  a  deterministic 
function  of  the  state  at  time  t  and  t  . 

(4)  randomized  Markov  or  randomized  memoryless ,  if  its  action  at  time  t 
is  a  randomized  function  of  the  state  at  time  t  and  t  . 

i  It  follows  from  results  presented  by  Derman  and  Strauch  [5]  that  if  the 
initial  state  is  fixed  then  we  need  never  go  outside  the  class  of  randomized 

memoryless  policies.  That  is,  for  any  policy  f  and  initial  state  i  ,  there 

1 

I  ' 

exists  a  randomized  memory less  policy  f'  such  that 

I 


Vf,(i)  1  Vf(i)  . 
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They  prove  this  by  shoving  that  the  class  of  randomized  memoryless  policies  is 
large  enough  so  that,  for  any  policy  f  ,  there  exists  a  randomized  memory less 
policy  f'  such  that 


Pf{Xt  •  -  a  I  X0  -  i>  -  Pff{Xt  -  j,at  -  a  I  X0  -  i> 


which,  of  course,  implies  that  V^(i)  ■  V^,(i)  . 

We  now  show,  by  counterexample,  that  we  cannot  generally  restrict  attention 
to  either  the  class  of  randomized  stationary  policies  or  to  the  class  of 
memoryless  policies. 

The  first  counterexample  shows  that  we  cannot  always  restrict  attention  to 
the  memoryless  policies. 


Example  1: 


Let  the  states  be  given  by  0,1,1', 2, 2’,  ...  .  State  0  is  an  absorbing 
state  and  once  entered  can  never  be  left,  i.e., 


poo  ■  1  • 


In  state  n,n  >  0  ,  there  are  2  possible  actions  having  respective  transition 
probabilities 


P  (1)  -  P  f(2)  -  1  ,  n  >  0  . 
n,nrri  n,n 


In  state  Yi' ,n  >  0  ,  there  is  a  single  available  action,  having  transition 
probabilities 


Pn\(n-1)'  “  1  n  >  1 


Pl',0  "  1  n  "  1  ‘ 


The  rewards  depend  only  on  the  state  and  are  given  by 


mwMvu'w  r+*PMVnutr*in*n  WMumn'V*^  iriw<^j#WWjtt*S^iW(WWty,^RSflW5W*1W'^r?P 


27 


R(n)  ■  0  n  >_  0 

R(n')  ■  1  n  >  0  . 


Suppose  the  initial  state  is  state  1.  It  is  easy  to  see  that  under  any  memoryless 
rule  the  total  expected  reward  will  be  finite.  However  the  randomized  stationary 
policy  which,  when  in  state  n  ,  selects  action  1  with  probability  and 

action  2  with  probability  1  -  a^  has  an  infinite  expected  return  when  the  a^ 
are  chosen  so  that 


n 

it  a.  -*>  0  as 
i-1 


n  -+■  °° 


and 


a 

TT 


n-1  i-1 


for  the  first  condition  implies  that  a  primed  state  will  eventually  be  reached 
with  probability  1  while  the  second  condition  implies  that  the  expected  number 
of  this  first  primed  state  is  infinite. 

The  second  example  shows  that  we  cannot  always  restrict  attention  to  the 
randomized  stationary  policies. 

Example  2: 

The  states  are  given  by  1,2,3,  ...»  00  .  In  state  n  there  are  2  possible 
actions  having  respective  transition  probabilities 


Pn,,*l(1)  ■  1 


P„,l<2)  ’  “»  ■ 1  -  V(2> 


1  <_  n  <  00 
1  <  n  <  00  . 


State  °»  is  an  absorbing  state,  i.e., 
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The  rewards  depend  only  on  the  state  and  are  given  by 

R(l)  -  1  • 

R(n)  *0  n  ■  2,3,  . . . ,  00  . 

The  values  a  are  chosen  to  satisfy 
n  J 


(13)  it  a  >  0  ,  a  <  1  all  n  . 

,  n  n 

n-1 

Suppose  that  the  initial  state  is  state  1.  It  is  easy  to  see  that  under  any 
randomized  stationary  policy  the  expected  number  of  visits  to  state  1  is  a 
geometric  random  variable  with  finite  means,  hence  the  total  expected  return  is 
finite.  However,  consider  the  policy  which  on  its  nth  return  to  state  1  chooses 
action  1  n  times  and  then  chooses  action  2.  Since,  by  (13)  this  policy  has  a 
positive  probability  of  visiting  state  1  infinitely  often,  it  has  an  infinite 
expected  return. 
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